#message=FALSE means some output messages aren't printed in the knitted HTML document

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE) 
#message=FALSE and warning=FALSE ensure any error or output messages aren't shown in the output HTML file
#echo=TRUE ensures that the code used to produce outputs can be accessed through the output file

#loading any necessary packages
library(dplyr)
library(tidyverse) 
library(knitr)
library(readr)
library(ggplot2)
library(bbmle)
library(mizer) 
library(Hmisc)
library(lme4)
library(RColorBrewer)
library(viridis)
library(hexbin)
library(ggpattern) 
library(sjPlot)      
load("stomach_dataset.Rdata")
#loading the dataset to be used in analysis (the dataset is already named stom_df)

df <- stom_df %>% 
  transmute(haul_id, ices_rectangle, year, pred_species, 
                            pred_weight_g, pred_length_cm, prey_weight_g, 
                            prey_type = prey_funcgrp, indiv_prey_weight = prey_ind_weight_g, 
                            prey_count, n_stomachs, 
                            no._prey_per_stmch = prey_count/n_stomachs, ppmr) 
#creating a new data set called 'df' which only contains certain selected columns
df <- df[df$indiv_prey_weight != 0, ]
df <- df[df$pred_weight_g != 0, ]
df <- df[df$indiv_prey_weight != Inf, ]
#Removes any points where the prey weight = Inf or 0, or the predator weight = 0

df<- df[!(is.na(df$year)),]
df<- df[!(is.na(df$ices_rectangle)),]
#Removes any points where the value is NA

df$lppmr <- log(df$ppmr)
df$lprey_weight <- log(df$indiv_prey_weight)
df$lpred_weight <- log(df$pred_weight_g)
#Adds columns which take the log value of PPMR, individual prey weight and predator weight to the main data set
renamed_df = df %>% 
  mutate(pred_species = replace(pred_species, 
                                pred_species == "Micromesistius poutassou", "Blue Whiting")) %>%
  mutate(pred_species = replace(pred_species, pred_species == "Gadus morhua", "Cod")) %>% 
  mutate(pred_species = replace(pred_species, pred_species == "Limanda limanda", "Common Dab")) %>%
  mutate(pred_species = replace(pred_species, 
                                pred_species == "Merluccius merluccius", "European Hake")) %>%  
  mutate(pred_species = replace(pred_species, pred_species == "Melanogrammus aeglefinus", "Haddock")) %>%  
  mutate(pred_species = replace(pred_species, pred_species == "Clupea harengus", "Herring")) %>% 
  mutate(pred_species = replace(pred_species, pred_species == "Trachurus trachurus", "Horse Mackerel")) %>%  
  mutate(pred_species = replace(pred_species, pred_species == "Scomber scombrus", "Mackerel")) %>%
  mutate(pred_species = replace(pred_species, pred_species == "Lepidorhombus whiffiagonis", "Megrim")) %>%
  mutate(pred_species = replace(pred_species, pred_species == "Lophius piscatorius", "Monkfish")) %>% 
  mutate(pred_species = replace(pred_species, pred_species == "Trisopterus esmarkii", "Norway Pout")) %>%  
  mutate(pred_species = replace(pred_species, pred_species == "Pleuronectes platessa", "Plaice")) %>%
  mutate(pred_species = replace(pred_species, pred_species == "Trisopterus minutus", "Poor Cod")) %>%  
  mutate(pred_species = replace(pred_species, pred_species == "Solea solea", "Sole")) %>% 
  mutate(pred_species = replace(pred_species, pred_species == "Sprattus sprattus", "Sprat")) %>%
  mutate(pred_species = replace(pred_species, pred_species == "Merlangius merlangus", "Whiting"))
#Renaming the predator species from their latin names (e.g. Gadus morhua) to their more common names (e.g. Cod)
#Creates a new data frame (called 'renamed_df') with these replaced names

species_list <- c("Blue Whiting", "Cod", "Common Dab", "European Hake", "Haddock", "Herring",
                     "Horse Mackerel", "Mackerel", "Megrim", "Monkfish", "Norway Pout",
                      "Plaice", "Poor Cod", "Sole", "Sprat", "Whiting")
#Creates an array called 'species_list', which is list of the predator species we are focusing on in this project

renamed_df <- renamed_df[renamed_df$pred_species %in% species_list, ]
#Removes any observations of predator species not in the species_list, i.e. they are irrelevant data points for this project

1 Introduction

This data set is formed from recordings taken by multiple ships between the years 1886-2016.

Fish were taken from some location, and their individual stomach contents recorded. Prey of (roughly) the same size found in a single predator’s stomach were recorded as part of a single observation. This gives the number of prey that make up a single observation for some single predator stomach, and is called ‘no._prey_per_stmch’.

Using similar logic,‘prey_weight_g’ is the total weight of prey in a single observation, and ‘indiv_prey_weight’ is then the approximate weight of one single prey individual found in the stomach, found using:

\[ \text{indiv_prey_weight} = \frac{\text{prey_weight_g}}{\text{no._prey_per_stmch}} \]

There are some other column names which are useful to note:

  • ices_rectangle accounts for the location of observations, where each area that the points are sampled in is split up into a number of smaller rectangles and each is given a unique name
  • haul_id is a identification tag given to each individual ship which recorded observations
  • The weight of each individual prey (indiv_prey_weight) and each predator (pred_weight_g) are measured in grams
  • PPMR stands for ‘predator prey mass ratio’, and is calculated using:

\[ \text{PPMR} = \frac{\text{predator mass}}{\text{prey mass}} \] (where the prey mass is the ‘indiv_prey_weight’, representing the prey mass of a single prey in the stomach of a single observation)

2 Distribution of prey type eaten for each predator

ggplot(renamed_df, aes(x=prey_type, fill=prey_type)) +
   geom_bar_pattern(stat = "count",
                    pattern_color = "white",
                    pattern_fill = "white",
                    aes(pattern = prey_type, pattern_angle = prey_type)) +
  facet_wrap(~renamed_df$pred_species, scale="free_y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
  labs(title = "Distribution of prey types for each predator species", x="Prey type",
       y="Number of observations")

#geom_bar_pattern() creates a bar chart from the specified data, where each category has a different pattern used to fill it
#Using different patterns as well as different colours to fill the bars ensures the data can be understood if looked at in grey scale, and makes the document accessible for colour blind people
#facet_wrap() means an individual bar chart is created for each individual predator species
#'theme()' adjusts the labels along the x-axis so they lie slightly angled to the vertical direction so they are easy to read
#legend.position="none" means that there is no key to explain what each bar shows, as the x-axis is very clear in describing each bar

These are individual bar charts showing the distribution of the type of prey each predator eats.

The prey types are:

  • Benthos: organisms that live on/near the bottom of a body of water
  • Fish: aquatic, craniate (have a skull), gill-bearing animals that lack limbs with digits (i.e. their limbs don’t have toes or fingers). Fish are a type of nekton, but for this data set fish are recorded as separate entities.
  • Nekton: the actively swimming aquatic organisms that can resist a strong current of water (the classification of nekton for this data set excludes any fish)
  • Zooplankton: animal types of plankton (where plankton are aquatic organisms that are unable to swim effectively against currents)
  • Other: miscellaneous types of aquatic animals that don’t fit into the other categories, including sand, debris, crustaceans and unclassified digested remains

We can see that some predators eat a range of prey types (e.g. Whiting), while others prefer to eat a single prey type in abundance (e.g. European Hake and Monkfish mostly consume fish as their prey).

3 Distribution of prey masses

ggplot(renamed_df, aes(x=lprey_weight)) +
  geom_density(aes(weight = no._prey_per_stmch), color="red") +
  labs(title = "Density of log(prey mass) observations", x="log(prey mass)", 
      y="Number density of observations") + 
  facet_wrap(~pred_species, scale="free_y")   

This plot shows that different predator species consume differently sized distributions of prey. For example, some species have a ‘peak’ at roughly a single size of prey (such as Horse Mackerel and Norway Pout), and some species show a wider range of prey masses which they consume, such as:

  • Cod
  • Sprat, and
  • Megrim

4 Most common prey masses

renamed_df$haul_id_short <- gsub("\\-.*", "", renamed_df$haul_id)
renamed_df$haul_id_short <- gsub("_", "", renamed_df$haul_id_short)
#the haul_id values are renamed to be shortened versions of the ship names (e.g. CLYDE) rather than the complete id (e.g. CLYDE-1935-6)
ggplot(data = renamed_df, aes(lprey_weight, no._prey_per_stmch)) + 
      labs(title = "Number of prey per predator stomach v. log(prey mass) ", 
       x="log(prey mass)", y="Number of prey per predator stomach") + 
      geom_point(size=0.5)

#geom_point() adds individual points which show individual recorded points
#size=0.5 defines the size of each point on this graph

This graph is looking at the distribution of the mass of prey recorded, i.e. looking at what is the most common prey mass over all prey species. It is a graph over all the data points, so includes recordings from all the ships involved.

There are some potentially interesting results, so results from different ships were plotted on individual curves to further look into these results.

ggplot (data = renamed_df, aes(x=lprey_weight, y=no._prey_per_stmch, color=haul_id_short)) + 
  labs(title = "Number of prey per observation v. log(prey mass) - separated by all ship names", 
       x="log(prey mass)", y="Number of prey per observation") + 
  geom_point(size=0.02) +
  theme(strip.text = element_text(size = 7), legend.position="none", 
        plot.title = element_text(size=22), axis.title = element_text(size=22)) + 
  facet_wrap(~renamed_df$haul_id_short, scale="free_y")

Taking a closer look at some of these ships gives the plots below:

interesting_haul <- filter(renamed_df, 
                           haul_id_short=='CLYDE'|haul_id_short=='END04'|haul_id_short=='LUC'|
                             haul_id_short=='HIDDINK'|haul_id_short=='EXCmacDATSTO815'|
                             haul_id_short=="Excmacdatsto815error")
#interesting_haul is a new data frame which only contains values recorded by ships which gave "interesting" looking datapoints

haul_list_interesting <- unique(interesting_haul$haul_id_short)
#creates an array containing every individual (renamed) ship name that we are interested in

ggplot (data = interesting_haul, aes(x=lprey_weight, y=no._prey_per_stmch, 
                                     color=haul_id_short)) + 
  labs(title = "Number of prey per observation v. log(prey mass) 
       - separated by ship names, for a selection of 6 ships",
       x="log(prey mass)", y="Number of prey per observation") + 
  geom_point(size=0.02) +
  theme(strip.text = element_text(size = 10), legend.position="none") + 
  facet_wrap(~interesting_haul$haul_id_short, scale="free_y")

These six graphs show a range of unusual looking points:

  1. \(y \propto e^{-x}\) relation for END04 (i.e. no. per stomach is proportional to \(\frac{1}{\text{prey mass}}\))
  2. lots of observations for single prey masses for LUC and CLYDE
  3. lots of the same no. of prey per observation observations for HIDDINK
  4. the EXCmacDATSTO815 and Excmacdatsto815error seem to represent exactly the same data

There are a number of other ships which also provided potentially erroneous data, and the total number of datapoints affected are as below:

type1 <- length((filter(renamed_df, haul_id_short=='END04'))$haul_id_short)
#counting the number of observations recorded for each sort of error or interesting relation

type2 <- length((filter(renamed_df, haul_id_short=='CLION09' | haul_id_short=='LUC' | haul_id_short=='UNITY' | haul_id_short=='CLYDE'))$haul_id_short)

type3 <- length((filter(renamed_df, haul_id_short=='BEAVER'|haul_id_short=='BRUCELLA'|haul_id_short=='BULLEN'|haul_id_short=='CIROL04'|haul_id_short=='CIROL10'|haul_id_short=='CLION03'|haul_id_short=='CLION07'|haul_id_short=='CLION12'|haul_id_short=='COREL02'|haul_id_short=='COREL06'|haul_id_short=='CORY04'|haul_id_short=='CYPRIS'|haul_id_short=='END02'|haul_id_short=='END12'|haul_id_short=='HEINCKE147'|haul_id_short=='HIDDINK'|haul_id_short=='OITHONA'|haul_id_short=='STRANDLINE'|haul_id_short=='TELLINA'))$haul_id_short)

type4 <- length((filter(renamed_df, haul_id_short=='EXCmacDATSTO815'|haul_id_short=='Excmacdatsto815error'))$haul_id_short)

ship_erroneous <- data.frame("Graph_Type"=c("y proportional to exp(-x)", "Single prey mass"
                                      , "Same number of prey per observation", "Same data"),
                        "Number_of_observations"=c(type1, type2, type3, type4))
#creates a data frame which shows the number of observations of each error 'type'

kable(ship_erroneous)
Graph_Type Number_of_observations
y proportional to exp(-x) 2093
Single prey mass 554
Same number of prey per observation 26854
Same data 5618
#creates a table of the ship_erroneous data frame as an output

In total, 35445 out of the total 267431 observations lie within these potentially erroneous groups. This is: \[ \frac{35445}{267431} \times 100 = 13.25389 \% \]

This means that approximately \(13 \%\) of the data set comes from potentially unreliable sources. Though we will not remove or alter the data set to account for these potentially erroneous data points in this project, for further analysis it may be sensible to remove the data points these six ships recorded so that any results are as reliable as possible.

5 Prey and predator mass relation

We are attempting to find a link between the predator mass and the prey mass, and \(\log()\) of each axis is used to see the proportionality of the axes. The mathematics behind this is as follows:

\[ \log (\text{prey mas}) = m \times \text{log(predator mass)} + c , \]

where m is the gradient of the slope and c is the y-intercept (thinking of this graph as a linear model).

Taking the exponential of both sides,

\[ \text{prey mass} = \exp({{m}\times{\log(\text{predator mass})}}) + \log(D) , \] where \(\log(D) = c\).

Finally,

\[ \text{prey mass} = \text{predator mass}^m \times D \]

Hence, we want \(m = 1\) to show that predator mass is linearly proportional to prey mass (as expected). However, to simply show that the two variables are dependent on each other, we are looking for instances where \(m \neq 0\)

#message=FALSE means some output messages aren't printed in the knitted HTML document

ggplot(data = renamed_df, aes(lpred_weight, lprey_weight)) + 
  labs(title = "log(prey mass) v. log(predator mass) plot", 
       x="log(predator mass)", y="log(prey mass)") + 
  geom_hex(bins = 20) +
  scale_fill_viridis(direction=-1, option="C") +
  stat_smooth(method='lm', se=FALSE, colour="blue") +
  theme_classic()

# theme_classic() removes the grey background of the plot to make it easier to read and understand
# stat_smooth adds a line of best fit plotted through the points
# method='lm' ensures it is a straight line, i.e. a linear model
# se=FAlSE creates only a single line, with no error of margin included

#geom_hex creates "bins" in the data where the density is calculated over
#bins=20 means that there are 20 "bins" in the horizontal direction and 20 in the vertical
#scale_fill_viridis creates a density colour scale, where the least dense area is yellow and the most dense is purple

slope <- lm(lprey_weight ~ lpred_weight, data = renamed_df)

cat("Slope of the log(prey mass) v. log(predator mass) line of best fit:", 
        summary(slope)$coefficients[2],
    "\n",
    "Standard error of this line of best fit:", summary(slope)$coefficients[4])
## Slope of the log(prey mass) v. log(predator mass) line of best fit: 0.6712536 
##  Standard error of this line of best fit: 0.001915732
#'cat()' prints some output message as purely the characters included in the '()' section

# summary(slope)$coefficients creates an array containing the coefficients of the linear model of the relationship between two axes, including the slope of the LOBF and the standard error of this line
# The first item of the array gives out the y-intercept of the line, and the second item of the array is the gradient

For this graph, the slope is equal to \(0.6712536 \pm 0.001915732\). This is not equal to \(1\), and therefore this plot supports the assumption that the predator and prey masses are dependent on each other in some way.

For this graph, the most dense area (the area with the greatest number of points in a single hexagonal ‘bin’) is purple, and the least dense area is in yellow. Therefore, we can see that there are lots of observations around the point where \(\log(\text{predator mass}) = 6\) and \(\log(\text{prey mass}) = 0\), and near where \(\log(\text{predator mass}) = -1\) and \(\log(\text{prey mass}) = -5\). However, we can see that the line of best fit (in blue) isn’t plotted directly through these most dense areas, and therefore we can’t claim that this line of best fit is very reliable.

This graph is plotted over the entire data set and includes values from all of the available predator species. Instead, we will separate the plots by predator species to see if there is some proportionality between the predator and prey mass for each specific species.

5.1 Prey v. predator mass plot, separated by predator species

ggplot(data = renamed_df, aes(lpred_weight, lprey_weight)) + 
  labs(title = "log(prey mass) v. log(predator mass), separated by predator species", 
       x="log(predator mass)", y="log(prey mass)") +
  geom_point(size=0.01, colour="red") + 
  facet_wrap(~pred_species, scale="free_y") + 
  theme(strip.text = element_text(size = 10)) +
  geom_smooth(method='lm', colour="black")

#the grey bands around the lines of best fit represent the standard error of this line

5.2 Gradients of the plots for each predator species

i <- 1
species_grad = c()
uncert=c()
#setting i=1 for the while loop
#and creating an empty vector called 'species_grad' to hold data

while(i<=length(species_list)){
  species_df <- renamed_df %>% filter(pred_species == fixed(species_list[i]))
  
  grad <- coef(lm((species_df$lprey_weight)~(species_df$lpred_weight)))
  species_grad[i] <- grad[2]
  
  preyvpred.lm <- lm(lprey_weight ~ lpred_weight, data = species_df)
  uncert[i] <- summary(preyvpred.lm)$coefficients[4]
  
  i=i+1
}

#'while' means that this function repeats through the length of 'species_list' until all predator species are accounted for
#i=i+1 increases the value of i each time through this loop to motivate this repeat
#this creates a data frame (species_df) containing each individual predator species, then calculates the gradient of the log(pred mass) v. log(prey mass) graph for this individual predator species, then inserts this gradient value to the ith value of an array named 'species_grad'
#this also creates a linear model of log(prey mass) against log(predator mass) for each individual predator species, and extracts the standard error of the line of best fit for this linear model, storing this in an array called 'uncert'

pvp_grads <- data.frame("Predator_Species"=c(species_list),
                        Gradient=c(species_grad),
                        Error=c(uncert))

kable(pvp_grads)
Predator_Species Gradient Error
Blue Whiting -0.2433567 0.0912603
Cod 0.7269038 0.0019058
Common Dab 0.6359695 0.0519601
European Hake 0.7677213 0.0271726
Haddock 0.7004115 0.0069827
Herring 0.1507461 0.0228279
Horse Mackerel 0.5960227 0.0450872
Mackerel 0.2379693 0.0245757
Megrim 1.1524540 0.0593195
Monkfish 0.8168852 0.0269028
Norway Pout 0.9604312 0.0371339
Plaice 0.7014262 0.0155969
Poor Cod 1.0845397 0.1052513
Sole 0.3803803 0.1200316
Sprat 0.5874927 0.1286451
Whiting 0.6945304 0.0028509
#creates a new data frame name 'pvp_grads', with one column named 'pred_species' and the other named 'gradient'
#each row contains the name of some individual predator species and its associated gradient
#kable prints the information in the pvp_grads

Here, we are splitting up the graphs over each individual predator species to look and see if there is a proportionality between the predator and prey masses for each specific predator species.

We are wanting gradient \(\neq 0\) to suggest dependence between the predator and prey mass, and specifically gradient \(= 1\) to suggest that they are linearly proportional.

The gradient \(\neq 0 \pm\) (their standard error) for all of the predator species. Therefore, the assumption that the predator and prey masses are dependent on each other is a reasonable one and it holds for every predator species.

Having proportionality between the predator and prey masses allows us to use the PPMR (predator prey mass ratio) for further analysis in this project.

6 PPMR for individual species

Here, we are looking for the most common log(PPMR) value for each individual species. This will be found by plotting the density of the log(PPMR) observations, and weighting these observations based on two different variables. By assuming that these are normally distributed relations, there should be a ‘peak’ point of log(PPMR) which is where the most common log(PPMR) value lies. We assume this will be unique for different predator species, i.e. each predator has a preferred PPMR value, and hence each predator species has a preferred prey size relative to the mass of the predator that it most frequently consumes.

6.1 Weighted by prey biomass

These plots are weighted by the prey biomass. This means that we are looking at the distribution of prey biomass across values of log(PPMR), so observations are counted up according to their biomass. This means that a single observation which includes prey of a large biomass will represent a large density, and a single observation of prey of a small biomass will be shown as having a small density,

ggplot(data = renamed_df, aes(x=lppmr)) + 
  labs(title = "Density plot of log(PPMR), weighted by biomass density of prey", 
       x="log(PPMR)", y="Biomass density of prey") +
  facet_wrap(~renamed_df$pred_species, scale="free_y") + 
  theme(strip.text = element_text(size = 5)) +
  geom_density(aes(weight = prey_weight_g), colour="red")

#using facet_wrap for the variable (pred_species) means the data is separated into individual plots for each predator species
#geom_density means area under the curve = 1 (i.e. the graph is normalised)
#weight=prey_weight_g means the points are 'weighted' by the column weight (biomass) of each individual prey

6.2 Weighted by number of prey

These plots are weighted by the number pf prey per observation. This means that data points with a larger number of prey per observation are prioritised/weighted more than those with a smaller number of prey per observation.

ggplot(data = renamed_df, aes(x=lppmr)) + 
  labs(title = "Density plot of log(PPMR), weighted by number density of prey", 
       x="log(PPMR)", y="Number density of prey") +
  facet_wrap(~renamed_df$pred_species, scale="free_y") + 
  theme(strip.text = element_text(size = 5)) +
  geom_density(aes(weight = no._prey_per_stmch), colour="green")

#weight=no._prey_per_stmch means the points are 'weighted' by the number of prey per observation

By looking at the two differently weighted graphs, we can see that the mean of the plots are ‘shifted’ by some amount (e.g. the Blue Whiting when weighted by prey biomass has a mean log(PPMR) of roughly 4.8, and when weighted by number of prey this mean becomes roughly 7.5).

7 Specific PPMR calculations by different weightings for Herring species

Here, we are only looking at observations relating of the predator type Herring.

herringDF <- renamed_df %>% 
    filter(pred_species == fixed("Herring"))
#creates a separate data set (called 'herringDF') only containing observations for the predator species 'Herring'

7.1 Weighted by prey biomass

The curve with points weighted by the prey biomass is plotted in blue, and a normal curve (also weighted by prey biomasss) is plotted over the top as a dashed black line.

bio_herringmean = weighted.mean(herringDF$lppmr, w = herringDF$prey_weight_g, na.rm = TRUE)
bio_herringSD = sqrt(wtd.var(herringDF$lppmr, w = herringDF$prey_weight_g, na.rm = TRUE))

#weighted.mean gives the arithmetic mean of log(PPMR), where datapoints are weighted by the prey weight of observations
#similarly, wtd.rvar is the variance of log(PPMR), where the datapoints are also weighted by the prey weight
#standard deviation is defined as the square root of the variance
#na.rm=TRUE means any rows with missing values (values that equal 'na') aren't included in the mean/variance calculations, but are instead ignored

ggplot(data = herringDF, aes(x=lppmr)) + 
          labs(title = "Density plot of log(PPMR) for Herring,
               weighted by prey biomass", 
               x="log(PPMR)",y="Biomass density of observations") +
          geom_density(aes(weight = prey_weight_g), colour="red") + 
          theme(plot.title = element_text(size=15), axis.text.y = element_text(size=10)) + 
          stat_function(fun = dnorm, 
                        linetype="dashed",
                        args= with(herringDF, c(mean = bio_herringmean, sd = bio_herringSD))) +
          xlim(-5,17)

#axis.text.y adjusts the size of the axis labels on the y-axis so they are most easily readable
#stat_function adds a normal distribution curve (fun=dnorm) with a mean and standard deviation equal to what was calculated earlier for this data set
#linetype makes the line for the normal distribution into a dotted (black) line
#xlim sets limits for the x-axis so that all the necessary data points can be seen

cat("Mean value of log(PPMR), weighted by biomass of prey:", bio_herringmean,
    "\n",
    "Standard deviation of this:", bio_herringSD)
## Mean value of log(PPMR), weighted by biomass of prey: 6.781942 
##  Standard deviation of this: 2.370106
#"\n| creates a line break so that the outputs appear on different lines in the knitted html document

7.2 Weighted by number of prey

The curve with points weighted by the number of prey per observation is plotted in red, and a normal curve (also weighted by the number of prey per observation) is plotted over the top as a dotted black line.

no_herringmean = weighted.mean(herringDF$lppmr, w = herringDF$no._prey_per_stmch, na.rm = TRUE)
no_herringSD = sqrt(wtd.var(herringDF$lppmr, w = herringDF$no._prey_per_stmch, na.rm = TRUE))
#the mean and variance are both weighted by the number of prey

ggplot(data = herringDF, aes(x=lppmr), group=1) + 
          labs(title = "Density plot of log(PPMR) for Herring, 
               weighted by number of prey per observation",
               x="log(PPMR)", y="Number density of observations") +
          geom_density(aes(weight = no._prey_per_stmch), colour="green") + 
          theme(plot.title = element_text(size=15), axis.text.y = element_text(size=10)) +
          stat_function(fun = dnorm, 
                        linetype = "dotted",
                        args= with(herringDF, c(mean = no_herringmean, sd = no_herringSD))) +
          xlim(0,25)

#the xlim is different to the graph above because these data points lie over a slightly different log(PPMR) range

cat("Mean value of log(PPMR), weighted by the number of prey per observation:", no_herringmean,
    "\n",
    "Standard deviation of this:", no_herringSD)
## Mean value of log(PPMR), weighted by the number of prey per observation: 13.68691 
##  Standard deviation of this: 2.473435

7.3 Combined graph

ggplot(data = herringDF, aes(x=lppmr), group=1) + 
          labs(title = "Density plot of log(PPMR) for Herring,
               with various weightings",
               x="log(PPMR)", y="Number/biomass density of observations",
               color="Line", linetype="Line") +
          geom_density(aes(weight = no._prey_per_stmch, colour="prey biomass",
                           linetype="prey biomass")) + 
          geom_density(aes(weight = prey_weight_g, colour="number of prey per observation",
                           linetype="number of prey per observation")) + 
          stat_function(fun = dnorm, 
                        args= with(herringDF, c(mean = no_herringmean, sd = no_herringSD)),
                        aes(colour="number of prey per observation - normal curve",
                            linetype="number of prey per observation - normal curve")) +
          stat_function(fun = dnorm,
                        args= with(herringDF, c(mean = bio_herringmean, sd = bio_herringSD)),
                        aes(colour="prey biomass - normal curve",
                            linetype="prey biomass - normal curve")) +
          theme(plot.title = element_text(size=20)) +
          scale_color_manual(values=c('green', 'black', 'red', 'black')) +
          scale_linetype_manual(values = c("solid", "dashed", "solid", "dotted")) +
          xlim(-5,25)

#scale_color_manual manually adds a key to the graph to describe what the differently coloured lines represent
#scale_linetype_manual manually adds a key to the graph to describe what the differently shaped lines represent (e.g. some lines are dotted, dashed or solid)
#each curve is given a name using 'aes(colour="")', and these are then given a specific colour using 'values='

This is a graph with both the the distribution of Herring log(PPMR) as weighted by prey biomass and number of prey plotted over each other (along with appropriately weighted normal distribution curves plotted over the top of each).

  • Prey biomass weighted mean \(\log\)(PPMR) value: \(6.781942 \pm 2.370106\) (where 2.370106 is the weighted by the prey biomass standard deviation for this curve).

and

  • Number of prey weighted mean \(\log\)(PPMR) value: \(13.68691 \pm 2.473435\) (the number of prey weighted standard deviation is 2.473435).

The mathematics of shifted means for these different weightings is:

\[ \text{weighted mean}_{\text{expected prey biomass}} = \text{weighted mean}_{\text{no. of prey }} - (\text{standard deviation}_{\text{no. of prey }})^2 \]

Hence, the expected result is:

\[ \text{weighted mean}_{\text{expected prey biomass}} = 13.68691 - (2.473435)^2 = 7.569029 \]

The \(\text{weighted mean}_{\text{expected prey biomass}}\) lies within the standard error (\(\pm 2.370106\)) of the prey biomass weighted mean as given by the plot above. Therefore, the shifting mean equations are relatively accurate for this data set with these weightings.

So, we can continue with the assumption that there is a difference in the mean of \(\log\)(PPMR) of the square of the standard deviation when we weight by number of prey per observation and prey biomass.

8 Predator mass against PPMR

8.1 Plotting PPMR against predator mass, separated by predator species

ggplot(data=renamed_df, aes(lpred_weight, lppmr)) + 
  geom_point(size=0.001, colour="cornflowerblue") +
  labs(title = "log(PPMR) v. log(predator mass) plot", 
       x="log(Predator mass)", y="log(PPMR)") + 
  stat_smooth (method='lm', colour="black") + 
  facet_wrap(~pred_species, scale="free_y")

Graphing \(\log\)(predator mass) v. \(\log\)(PPMR) for individual predator species to see if the predator mass related to the PPMR.

\[ \log(\text{PPMR}) = m \times \log (\text{predator mass}) + c \]

where m is the gradient (assuming a linear model between the variables) and c is the y-intercept. Then,

\[ \text{PPMR} = (\text{predator mass})^m + D \]

(where \(c = \log(D)\))

To prove that log(predator mass) is not related to the log(PPMR), we want the two to not be proportional and hence we are looking for slope, \(m = 0\). This would mean that the PPMR is independent of the predator mass for a certain species (as assumed).

j <- 1
species_grad_ppmr = c()
uncert_ppmr = c()

while(j<=length(species_list)){
  species_df <- renamed_df %>% filter(pred_species == fixed(species_list[j]))
  
  grad <- coef(lm((species_df$lppmr)~(species_df$lpred_weight)))
  species_grad_ppmr[j] <- grad[2]
  
  ppmrvpred.lm <- lm(lppmr ~ lpred_weight, data = species_df)
  uncert_ppmr[j] <- summary(ppmrvpred.lm)$coefficients[4]
  
  j=j+1
}

ppmrvp_grads <- data.frame("Predator_species"=c(species_list),
                        Gradient=c(species_grad_ppmr),
                        "Standard_Error" = c(uncert_ppmr))

kable(ppmrvp_grads)
Predator_species Gradient Standard_Error
Blue Whiting 1.2433567 0.0912603
Cod 0.2730962 0.0019058
Common Dab 0.3640305 0.0519601
European Hake 0.2322787 0.0271726
Haddock 0.2995885 0.0069827
Herring 0.8492539 0.0228279
Horse Mackerel 0.4039773 0.0450872
Mackerel 0.7620307 0.0245757
Megrim -0.1524540 0.0593195
Monkfish 0.1831148 0.0269028
Norway Pout 0.0395688 0.0371339
Plaice 0.2985738 0.0155969
Poor Cod -0.0845397 0.1052513
Sole 0.6196197 0.1200316
Sprat 0.4125073 0.1286451
Whiting 0.3054696 0.0028509

\(\pm\) their standard error, the species Megrim, Norway Pout, Poor Cod have a gradient equal to \(0\) (to 1 significant figure). This is not many of the possible species,, so we have not supported the idea that the assumption of independence between the \(\log(\text{PPMR})\) and \(\log(\text{predator mass})\) can be upheld for this data set.

8.2 Different weightings of the line of best fit

Herring_df <- renamed_df %>% filter(pred_species == fixed("Herring"))
#creating a data frame only containing observations where the predator species is herring

ggplot(data=Herring_df, aes(lpred_weight, lppmr)) + 
  geom_point(size=0.5) +
  labs(title = "log(PPMR) v. log(predator mass) plot: Herring", 
       x="log(Predator mass)", y="log(PPMR)", color="Weightings", linetype="Weightings") + 
  stat_smooth(aes(weight=no._prey_per_stmch, color='by number of prey in stomach',  
                  linetype='by number of prey in stomach'), 
              method='lm', se=FALSE) +
  stat_smooth(aes(weight=prey_weight_g, color='by prey biomass', linetype='by prey biomass'), 
              method='lm', se=FALSE) +
  stat_smooth(aes(color='no weighting', linetype='no weighting'), method='lm', se=FALSE) +
  stat_summary(fun.data=mean_se, geom="linerange") +
  scale_color_manual(values=c("green", "red", "blue")) +
  scale_linetype_manual(values = c("dashed", "dotted", "solid")) +
  theme_classic()

# se=FALSE doesn't add an area of error around the LOBF 
# fun.data=mean_se calculates the mean and standard error for each point
# linerange draws a point range between an upper and lower limit for the line, and the mean for the point (using the values calculated in 'fun.data=mean_se')

#weighted by number of prey per observation
perstmch_weighting <- lm(lppmr ~ lpred_weight, data = Herring_df, 
                         weight=Herring_df$no._prey_per_stmch)

#weighted by prey biomass
biomass_weighting <- lm(lppmr ~ lpred_weight, data = Herring_df, 
                        weight=Herring_df$prey_weight_g)

#no weighting
no_weighting <- lm(lppmr ~ lpred_weight, data = Herring_df)

cat("Weighted by the number of prey per observation:", 
    "\n",
    "Slope:", summary(perstmch_weighting)$coefficients[2],
    "\n",
    "Standard error:", summary(perstmch_weighting)$coefficients[4],
    "\n")
## Weighted by the number of prey per observation: 
##  Slope: 1.310481 
##  Standard error: 0.04267642
cat("Weighted by the prey biomass:", 
    "\n",
    "Slope:", summary(biomass_weighting)$coefficients[2],
    "\n",
    "Standard error:", summary(biomass_weighting)$coefficients[4],
    "\n")
## Weighted by the prey biomass: 
##  Slope: 1.350805 
##  Standard error: 0.06601733
cat("With no weightings in the calculations:", 
    "\n",
    "Slope:", summary(no_weighting)$coefficients[2],
    "\n",
    "Standard error:", summary(no_weighting)$coefficients[4],
    "\n")
## With no weightings in the calculations: 
##  Slope: 0.8492539 
##  Standard error: 0.02282792

Looking at just the predator species ‘Herring’. There are three LOBF (line of best fit), weighted by different variables, each with individual values for their gradient.

For observations of the predator species Poor Cod the gradient is equal to \(1\) (to 1 significant to figure and \(\pm\) (their standard error)) when the line of best fit is weighted by either the prey biomass or the number of prey in the stomach. This is compelling evidence for some dependency between the PPMR and the predator mass.

8.3 Residuals of the log(predator mass) against log(PPMR) plot

We are now proceeding with the idea that the \(\log(\text{predator mass})\) and \(\log(\text{PPMR})\) are dependent on one another in some way, hence the predator mass is dependent on the PPMR.

`Heteroscedasticity’ is when the residuals (error terms) are unequally scattered. However, residuals with a constant variance will show a density that is approximately normally distributed. Therefore, if residuals do not show a normal distribution (and instead show heteroscedasticity), then it is unclear whether or not the method of a linear relationship is appropriate to fit the data set.

For the arbitrarily selected predator species Herring, we get the following plots which all discuss the idea of normally distributed residuals.

herring_df <-  renamed_df %>% filter(pred_species == fixed("Herring"))

herring_model <- lm(lppmr ~ lpred_weight, data=herring_df) 
res <- resid(herring_model)
predicted <- predict(herring_model)
#lm() fits a regression models for log(PPMR) against log(predator mass)
#resid() creates a list of the residual of each individual data point
#predict() gives the predicted value of a data point by using the previous behaviour and patterns of the data to determine possible future values

8.3.1 With residuals of each point to the predicted values of each point

ggplot(herring_df, aes(x = lpred_weight, y = lppmr)) +
  geom_smooth(method = "lm", se = FALSE, colour="black") +  
  geom_segment(aes(xend = lpred_weight, yend = predicted), colour="red", size=0.1) +
  geom_point(size=0.5) +
  labs(title='The log(predator mass) against log(PPMR) graph, 
       with lines from the individual points to the line of best fit: Herring', 
       x='log(predator mass)', y='log(PPMR)')

#showing the residuals of each point to the line of best fit
#geom_smooth() creates the regression line of the regression model from lm()
#geom_segment() draws a line from each point to the regression line 

This plot shows how distributed the points are in comparison to the line of best fit. For a ‘well fitting’ line of best fit, we would expect the residuals to be approximately normally distributed. As discussed earlier, the residual of a point is equal to: \[ \text{Residual} = \text{actual value} - \text{predicted/fitted value} , \] where the predicted/fitted value is equal to the value as predicted by the line of best fit.

In order to display a linear relationship between the axes, we would expect that the residuals of this plot would be normally distributed. This would meant that the data points are equally scattered around the line of best fit, and therefore the two axes would show linearity.

8.3.2 Residuals v. fitted values

To further investigate this linearity, we plot the residuals (as calculated from the linear model) against the fitted values as predicted by the line of best fit.

ggplot(herring_model, aes(x = .fitted, y = .resid)) +
  geom_point(colour="red", size=0.5) +
  geom_hline(yintercept = 0, colour="grey") +
  labs(title='Plot of the residuals against the fitted values: Herring',
       x='Fitted Values', y='Residuals') +
  geom_smooth() + 
  theme_classic()

#this plots the residuals of each data point, and adds a line across at res=0 to help visualise the data
#.resid creates a list of the residuals of each individual data point

This graph plots the residual of each observation against its fitted value (as predicted by the line of best fit). This plot is used to detect unequal residual variances and any anomalous points.

We can see that the residuals generally increase in value as the fitted values increase, which suggests that the residuals are potentially not normally distributed.

8.3.3 Density of residuals

ggplot(herring_model, aes(x = .resid)) +
  geom_density(colour="red") +
  labs(title='Density plot of the residuals: Herring', x='Residuals', y='Density') +
  theme_classic()

#plots the density of the residuals of the lppmr v lpred mass graph for Herring

By plotting the density of the residuals, we can see that for the predator species Herring the residuals do not look normally distributed. This suggests that the residuals are unequally scattered, and hence we continue with the idea that \(\log{(\text{PPMR})}\) and \(\log{(\text{predator mass})}\) are independent of each other because there appears to be no normal distribution of residuals and thus no linearity between the axes.

8.3.4 qqplot

qqnorm(res, xlab = "Theoretical Residuals", ylab = "Sample Residuals",
       main="Normal Q-Q plot of residuals - for the species Herring")
qqline(res, col="blue")

#plot a Q-Q plot to determine if the residuals follow a normal distribution
#for a normal distribution, we would expect the data points to fall along roughly a straight line at a 45 degree angle

Laslty, a qqplot (quantile-quantile plot) shows if two data sets come from the same distribution. For this data set, we plot the actual residuals against the expected residuals if the residuals were normally distributed. The points on this graph should fall on the straight line (at a 45 degree angle), and if they don’t then the residuals do not appear to be normally distributed.

We can see that these points roughly fall along the 45 degree-angled line, but not strongly and therefore we can claim that this is not evidence for normally distributed residuals.

Therefore, all previous graphs support the idea that the residuals show heteroscedasticity (meaning that they are unequally scattered), hence the residuals are not normally distributed. This means we do not have significant supporting evidence that the two axes are linearly related, and therefore we can claim that the \(\log\)(PPMR) and \(\log\)(predator mass) in fact are independent of each other.

8.4 Residual plots for the individual predator species

model<- lm(lppmr~lpred_weight, data=renamed_df) 

ggplot(model, aes(x = .resid)) + 
  geom_density(colour="red") + 
  facet_wrap(~renamed_df$pred_species, scales = "free_y") +
  labs(title="Density plot of the residuals", x='Residuals', y='Density')

#Plotting the density of the residuals for each predator species

This is a histogram plot of the residuals of the plot, separated for each predator species. As discussed earlier, a normal distribution of the density of residuals would suggest that the data fits a linear model. This would then infer that the axes would be related to one another.

However, none of the predator species produce a density plot which looks convincingly normally distributed. Therefore, we can say that this data suggest that the PPMR and predator mass are independent of one another when modelling using a linear model. This also suggests that using a linear model (where we approximate \(y=mx +c\)) is not appropriate for this set of data.

Instead, we use a different model (a mixed effects model) to fit this data set to in order to attempt to more accurately approximate certain parameters in the variable relationships.

9 Introducing a mixed effects model

We will now fit the data to a mixed effects model, which allows us to account for some variance that external effects may cause to the data.

9.1 Building up the model

9.1.1 Basic Linear model

This is mathematically represented by: \[ \log(\text{PPMR})_i = \beta_1 \times \log(\text{predator mass})_i + \beta_0 \]

Where:

  • \(i\) is the index of the individual observation
  • \(\beta_{0}\) is the overall intercept term (defined at the point where \(\log(\text{predator mass}) = 0\).)
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
renamed_df %>% 
    ggplot(aes(x=lpred_weight, y=lppmr, group=pred_species, color=pred_species)) + 
    geom_point(size=0.1) + 
    labs(x='log(predator mass)', y='log(PPMR)') +
    theme_classic()

#theme_classic() removes the grey background of the plot to make it easier to read

This plot is very difficult to interpret, and the relationship between the \(\log(\text{PPMR})\) and \(\log(\text{predator mass})\) is difficult to visualise as there is not a single gradient for each predator species. To aid in our interpretation, We will add random effects to account for some of the variance in the \(\log(\text{PPMR})\) values.

9.1.2 First trial

Fixed effects: \(\log\)(predator mass)

Random effects: year (fixed slopes and random intercepts)

one <- lmer(lppmr ~ lpred_weight + (1|year), data = renamed_df, REML=FALSE)
#same slope for every predator species, but varying intercept

year_list <- c(sort(unique(renamed_df$year)))
year_list_2000 <- year_list[c(year_list>2000)]
#a vector only including the values for year which are greater than 2000

one_df <- data.frame(Year=c(year_list),
                        "Slope"=c(coef(one)$year[2]))
one_df <- one_df %>% transmute(Year, Slope=lpred_weight) 
one_df <- one_df[one_df$Year %in% year_list_2000, ]
kable(one_df)
Year Slope
84 2001 0.1218916
85 2002 0.1218916
86 2004 0.1218916
87 2005 0.1218916
88 2006 0.1218916
89 2007 0.1218916
90 2008 0.1218916
91 2009 0.1218916
92 2010 0.1218916
93 2011 0.1218916
94 2012 0.1218916
95 2013 0.1218916
96 2015 0.1218916
97 2016 0.1218916
#This outputs the intercept term for each predator species for the fitted model, and only for values where the year is after 2000

plot_model(one, type = "re", vline.color="grey", dot.size=1.1, value.size=3,
           show.values=T, value.offset=.3) + 
        labs(x='Year', y='Value', title='Random effects: Intercepts - Model 1') + 
        xlim(factor(year_list_2000))

#Plots a model of the estimated values of the random effects variable

Here, there is the same slope for every predator species, but each species group has a different y-intercept (i.e. the line for each predator species group crosses the \(\log\)(predator mass)=0 line at a different value for \(\log\)(PPMR)). These values are only approximated, and the plot shows that

This is mathematically represented by:

\[ \log(\text{PPMR})_{iy} = (\beta_{0} + \alpha_{0y}) + \beta_{1} \log(\text{predator mass})_{i} + \epsilon_{iy} \]

Where:

  • \(i\) is the index of the individual observation
  • \(y\) is the index of the year
  • \(\beta_{0}\) is the overall intercept term
  • \(\alpha_{0y}\) is the random intercept term for year \(y\) (where \(\alpha_{0j}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{A[00]}\) (i.e. \(\alpha_{0y} \sim (0, \sigma^2_{A[00]})\))).
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
  • \(\epsilon_{iy}\) is the error term, which describes the residuals of the model

9.1.3 Second trial

Fixed effects: \(\log\)(predator mass)

Random effects: year (random slope and fixed intercept)

two <- lmer(lppmr ~ lpred_weight + (0 + lpred_weight|year), data = renamed_df, REML=FALSE)
#(0+ | ) means that the slope of the graph is different for each predator species, but they each group has the same intercept

two_df <- data.frame(Year=c(year_list),
                        "Intercept"=c(coef(two)$year[1]))
two_df <- two_df %>% transmute(Year, Intercept=X.Intercept.) 
two_df <- two_df[two_df$Year %in% year_list_2000, ]
kable(two_df)
Year Intercept
84 2001 5.553805
85 2002 5.553805
86 2004 5.553805
87 2005 5.553805
88 2006 5.553805
89 2007 5.553805
90 2008 5.553805
91 2009 5.553805
92 2010 5.553805
93 2011 5.553805
94 2012 5.553805
95 2013 5.553805
96 2015 5.553805
97 2016 5.553805
plot_model(two, type = "re", vline.color="grey", dot.size=1.1, value.size=3,
           show.values=T, value.offset=.3) + 
        labs(x='Year', y='Value', title='Random effects: Slope - Model 2') + 
        xlim(factor(year_list_2000))

This is mathematically represented by:

\[ \log(\text{PPMR})_{ij} = \beta_{0} + (\beta_{1} + \alpha_{1y}) \log(\text{predator mass})_{i} + \epsilon_{iy} \]

Where:

  • \(i\) is the index of the individual observation
  • \(y\) is the index of the year
  • \(\beta_{0}\) is the overall intercept term
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
  • \(\alpha_{1y}\) is the slope term that’s dependent on the random effect (the year) (where \(\alpha_{1y}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{A[10]}\) (i.e. \(\alpha_{1y} \sim (0, \sigma^2_{A[10]})\))).
  • \(\epsilon_{iy}\) is the error term, which describes the residuals of the model

For this plot, the value of log(predator mass) is modelled to be dependent on the year that each observation is taken. This allows us to have lines for each unique year which all have the same intercept but differing slopes.

9.1.4 Third trial

Fixed effects: \(\log\)(predator mass)

Random effects: year (random slope and random intercept)

three <- lmer(lppmr ~ lpred_weight + (1+lpred_weight|year), data = renamed_df, REML=FALSE)
#Random intercept and random slope (correlated)

plot_model(three, type = "re", vline.color="grey", dot.size=1.1, value.size=3,
           show.values=T, value.offset=.3) + 
        labs(x='Year', y='Value', title='Random effects - Model 3') + 
        xlim(factor(year_list_2000))

This is mathematically represented by: \[ \log(\text{PPMR})_{ij} = (\beta_{0} + \alpha_{0j}) + (\beta_{1} + \alpha_{1j}) \log(\text{predator mass})_{i} + \epsilon_{ij} \]

Where:

  • \(i\) is the index of the individual observation
  • \(y\) is the index of the year
  • \(\beta_{0}\) is the overall intercept term
  • \(\alpha_{0j}\) is the random intercept term for predator species \(j\) (where \(\alpha_{0y}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{A[00]}\) (i.e. \(\alpha_{0y} \sim (0, \sigma^2_{A[00]})\))).
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
  • \(\alpha_{1y}\) is the slope term that’s dependent on the random effect (the year)
  • \(\epsilon_{iy}\) is the error term, which describes the residuals of the model

From the table, we can see this model features slopes and intercepts that are different for each year are different.

9.1.5 Comparing the models

anova(one,two,three)
## Data: renamed_df
## Models:
## one: lppmr ~ lpred_weight + (1 | year)
## two: lppmr ~ lpred_weight + (0 + lpred_weight | year)
## three: lppmr ~ lpred_weight + (1 + lpred_weight | year)
##       npar     AIC     BIC  logLik deviance Chisq Df Pr(>Chisq)    
## one      4 1148144 1148186 -574068  1148136                        
## two      4 1157243 1157285 -578618  1157235     0  0               
## three    6 1140374 1140437 -570181  1140362 16873  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#comparing the three models

We can see that the third trial has a significantly smaller BIC value, and hence we will choose to continue using models which have slopes and intercepts that vary according to the fixed effect.

9.2 Trialing different models - only for Mackerel

Here, all the random effects are modeled with a randomly distributed slope and intercept for each level of a grouping.

9.2.1 With ship name

mackerel_df <-  renamed_df %>% filter(pred_species == fixed("Mackerel"))

a <- lmer(lppmr ~ lpred_weight + (1+lpred_weight|haul_id_short), data = mackerel_df, REML=FALSE)

summary(a)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short)
##    Data: mackerel_df
## 
##      AIC      BIC   logLik deviance df.resid 
## 129180.0 129228.8 -64584.0 129168.0    25067 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3883 -0.8476 -0.0531  0.8054  3.2284 
## 
## Random effects:
##  Groups        Name         Variance Std.Dev. Corr 
##  haul_id_short (Intercept)   5.67578 2.3824        
##                lpred_weight  0.06209 0.2492   -0.03
##  Residual                   10.07684 3.1744        
## Number of obs: 25073, groups:  haul_id_short, 17
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)    5.6113     0.7486   7.496
## lpred_weight   0.5082     0.1036   4.907
## 
## Correlation of Fixed Effects:
##             (Intr)
## lpred_weght -0.482
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00396204 (tol = 0.002, component 1)

This is mathematically represented by:

\[ \log(\text{PPMR})_{ih} = (\beta_{0} + H_{0h}) + (\beta_{1} + H_{1h}) \log(\text{predator mass})_{i} + \epsilon_{ih} \]

Where:

  • \(i\) is the index of the individual observation
  • \(h\) is the index of the different haul_id name
  • \(\beta_{0}\) is the overall intercept term
  • \(H_{0h}\) is the random intercept term for haul_id \(h\) (where \(H_{0h}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{H[00]}\) (i.e. \(H_{0h} \sim (0, \sigma^2_{H[00]})\))).
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
  • \(H_{1h}\) is the slope term that’s dependent on the random effect (the haul_id)
  • \(\epsilon_{ih}\) is the error term, which describes the residuals of each point

And:

  • Fixed effects: \(\log\)(predator mass)
  • Random effects: ship name
mackerel_df %>% 
  ggplot(aes(x = resid(a))) +
  geom_density(colour="blue") +
  labs(title='Density plot of the residuals for model a: Mackerel', x='Residuals', y='Density')

#resid(a) gives the residuals of model a
#The residuals are found by subtracting the fitted from the predicted values

We can plot the density of the residuals of this model. Normally distributed residuals suggests that the residuals have a constant variance and hence the axes are related in some way.

qqnorm(resid(a), xlab = "Theoretical Residuals", ylab = "Sample Residuals",
       main="Normal Q-Q plot of residuals - mixed effect model with ship name")
qqline(resid(a), col="blue")

We have also plotted a qqplot which shows the actual residuals (calculated from the data set) plotted against the expected residuals if they were completely normally distributed. If the residuals are perfectly normally distributed, then the points should fall on the straight line which is at a 45 degree angle.

We can see that the points fall roughly on the straight line but not completely, and therefore the data set does not very accurately fit to the model.

9.2.2 With year

b <- lmer(lppmr ~ lpred_weight + (1+lpred_weight|haul_id_short) + (1+lpred_weight|year), 
          data = mackerel_df, REML=FALSE)

summary(b)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short) + (1 +  
##     lpred_weight | year)
##    Data: mackerel_df
## 
##      AIC      BIC   logLik deviance df.resid 
## 129178.9 129252.1 -64580.5 129160.9    25064 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.3848 -0.8478 -0.0527  0.8037  3.2287 
## 
## Random effects:
##  Groups        Name         Variance Std.Dev. Corr 
##  haul_id_short (Intercept)   2.85041 1.6883        
##                lpred_weight  0.03410 0.1847   1.00 
##  year          (Intercept)   2.29299 1.5143        
##                lpred_weight  0.04607 0.2146   -1.00
##  Residual                   10.07302 3.1738        
## Number of obs: 25073, groups:  haul_id_short, 17; year, 15
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)    5.3859     0.7748   6.952
## lpred_weight   0.5352     0.1157   4.626
## 
## Correlation of Fixed Effects:
##             (Intr)
## lpred_weght -0.529
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Mathematical representation \[ \log(\text{PPMR})_{ihy} = (\beta_{0} + H_{0h} + Y_{0y}) + (\beta_{1} + H_{1h} + Y_{1y}) \log(\text{predator mass})_{i} + \epsilon_{ihyr} \]

Where:

  • \(y\) is the index of the different year the observation was taken
  • \(Y_{0y}\) is the random intercept term for haul_id \(h\) (where \(Y_{0y}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{Y[00]}\) (i.e. \(Y_{0y} \sim (0, \sigma^2_{Y[00]})\))).
  • \(Y_{1y}\) is the slope term that’s dependent on the random effect (the haul_id)

And:

  • Fixed effects: \(\log\)(predator mass)
  • Random effects: ship name, year
qqnorm(resid(b), xlab = "Theoretical Residuals", ylab = "Sample Residuals",
       main="Normal Q-Q plot of residuals - mixed effect model with ship name and year")
qqline(resid(b), col="blue")

mackerel_df %>% 
  ggplot(aes(x = resid(b))) +
  geom_density(colour="red") +
  labs(title='Density plot of the residuals for model b: Mackerel', x='Residuals', y='Density')

9.2.3 With location

c <- lmer(lppmr ~ lpred_weight + (1+lpred_weight|haul_id_short) + (1+lpred_weight|year) +
            (1+lpred_weight|ices_rectangle), data = mackerel_df, REML=FALSE)

summary(c)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short) + (1 +  
##     lpred_weight | year) + (1 + lpred_weight | ices_rectangle)
##    Data: mackerel_df
## 
##      AIC      BIC   logLik deviance df.resid 
## 126546.3 126643.9 -63261.2 126522.3    25061 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7025 -0.7479 -0.0034  0.7178  3.1882 
## 
## Random effects:
##  Groups         Name         Variance  Std.Dev. Corr 
##  ices_rectangle (Intercept)  21.036719 4.58658       
##                 lpred_weight  0.618101 0.78619  -0.97
##  haul_id_short  (Intercept)  10.066727 3.17281       
##                 lpred_weight  0.004668 0.06832  -1.00
##  year           (Intercept)   3.845649 1.96103       
##                 lpred_weight  0.075033 0.27392  -1.00
##  Residual                     8.790453 2.96487       
## Number of obs: 25073, groups:  ices_rectangle, 364; haul_id_short, 17; year, 15
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)    4.8912     1.2227   4.000
## lpred_weight   0.5958     0.1493   3.992
## 
## Correlation of Fixed Effects:
##             (Intr)
## lpred_weght -0.815
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Mathematical representation: \[ \log(\text{PPMR})_{ihyr} = (\beta_{0} + H_{0h} + Y_{0y} + R_{0r} )+ (\beta_{1} + H_{1h} + Y_{1y} + R_{1r}) \log(\text{predator mass})_{i} + \epsilon_{ihyr} \]

Where:

  • \(r\) is the index of the different ices_rectangle term
  • \(R_{0r}\) is the random intercept term for the ices_rectangle \(r\) (where \(R_{0r}\) is normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{R[00]}\) (i.e. \(R_{0r} \sim (0, \sigma^2_{R[00]})\))).
  • \(R_{1r}\) is the slope term that’s dependent on the random effect (the haul_id)

And:

  • Fixed effects: \(\log\)(predator mass)
  • Random effects: ship name, year, ices_rectangle
qqnorm(resid(c), xlab = "Theoretical Residuals", ylab = "Sample Residuals",
       main="Normal Q-Q plot of residuals - mixed effect model with ship name, 
              year, and location")
qqline(resid(c), col="blue")

mackerel_df %>% 
  ggplot(aes(x = resid(c))) +
  geom_density(colour="green") +
  labs(title='Density plot of the residuals for model c: Mackerel', x='Residuals', y='Density')

9.2.4 Comparing

ggplot(data=mackerel_df) +
  labs(title='Density plot of the residuals: Mackerel', x='Residuals', y='Density', 
       color="Model", linetype="Model") +
  geom_density(aes(x=resid(a), colour="with ship name", linetype="with ship name")) +
  geom_density(aes(x=resid(b), colour="with ship name and year", 
                  linetype="with ship name and year")) +
  geom_density(aes(x=resid(c), colour="with ship name, year and location", 
                   linetype="with ship name, year and location")) +
  scale_color_manual(values=c("blue", "red", "green")) +
  scale_linetype_manual(values = c("dashed", "dotted", "dotdash")) +
  theme_classic()

anova(a,b,c)
## Data: mackerel_df
## Models:
## a: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short)
## b: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short) + (1 + lpred_weight | year)
## c: lppmr ~ lpred_weight + (1 + lpred_weight | haul_id_short) + (1 + lpred_weight | year) + (1 + lpred_weight | ices_rectangle)
##   npar    AIC    BIC logLik deviance     Chisq Df Pr(>Chisq)    
## a    6 129180 129229 -64584   129168                            
## b    9 129179 129252 -64580   129161    7.1001  3    0.06878 .  
## c   12 126546 126644 -63261   126522 2638.5863  3    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#comparing the three models

The BIC value is the smallest when all three random effects are considered, and the residuals for this three-random-effect model look the most similar to a normal distribution. Therefore, we will choose to use a model which uses all three random effects for further analysis.

9.3 Mixed effects model - over all species

The third model (with ship name, year and location as random effects) appears to represent the data the most effectively. We now want to consider a mixed effects model over all the predator species (not just Herring). Therefore, we choose to continue with model c (with random effects:ship name, year, ices_rectangle), but we also introduce a fourth random effect of predator species.

 final <- lmer(lppmr ~ lpred_weight + pred_species + 
            (1+lpred_weight|haul_id_short) + (1+lpred_weight|year) +
            (1+lpred_weight|ices_rectangle), 
            data = renamed_df, REML=FALSE)

summary(final)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: 
## lppmr ~ lpred_weight + pred_species + (1 + lpred_weight | haul_id_short) +  
##     (1 + lpred_weight | year) + (1 + lpred_weight | ices_rectangle)
##    Data: renamed_df
## 
##       AIC       BIC    logLik  deviance  df.resid 
## 1050849.1 1051131.1 -525397.5 1050795.1    254568 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.7053 -0.6066 -0.0367  0.5777  6.7409 
## 
## Random effects:
##  Groups         Name         Variance Std.Dev. Corr 
##  ices_rectangle (Intercept)  0.79813  0.8934        
##                 lpred_weight 0.02387  0.1545   -0.78
##  year           (Intercept)  0.79693  0.8927        
##                 lpred_weight 0.01425  0.1194   -0.79
##  haul_id_short  (Intercept)  2.63966  1.6247        
##                 lpred_weight 0.06293  0.2509   -0.44
##  Residual                    3.58887  1.8944        
## Number of obs: 254595, groups:  
## ices_rectangle, 718; year, 97; haul_id_short, 97
## 
## Fixed effects:
##                              Estimate Std. Error t value
## (Intercept)                 5.0363484  0.2580857  19.514
## lpred_weight                0.2737651  0.0377947   7.243
## pred_speciesCod            -0.2123354  0.1259732  -1.686
## pred_speciesCommon Dab     -0.3095996  0.1419434  -2.181
## pred_speciesEuropean Hake  -1.4461576  0.1425841 -10.142
## pred_speciesHaddock         0.5266592  0.1270956   4.144
## pred_speciesHerring         1.0340250  0.1207008   8.567
## pred_speciesHorse Mackerel  1.1556948  0.2618773   4.413
## pred_speciesMackerel        1.0736578  0.1290907   8.317
## pred_speciesMegrim         -0.7387652  0.1448988  -5.098
## pred_speciesMonkfish       -1.9261290  0.1622835 -11.869
## pred_speciesNorway Pout     0.8562367  0.1738728   4.924
## pred_speciesPlaice          0.7833358  0.1361487   5.754
## pred_speciesPoor Cod       -0.0006363  0.1552063  -0.004
## pred_speciesSole            0.6118634  0.2000691   3.058
## pred_speciesSprat           0.9023672  0.1657724   5.443
## pred_speciesWhiting        -0.4030091  0.1256431  -3.208
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0264106 (tol = 0.002, component 1)

Mathematical representation: \[ \log(\text{PPMR})_{ihyrp} = (\beta_{0} + H_{0h} + Y_{0y} + R_{0r}) + (\beta_{1} + H_{1h} + Y_{1y} + R_{1r}) \log(\text{predator mass})_i + \\ \sum_{p=1}^{16} \alpha_p (\text{predator species})_p + \epsilon_{ihyr} \] Where:

  • \(i\) is the index of the individual observation
  • \(h\) is the index of the different ship name
  • \(y\) is the index of the the year each observation is taken
  • \(r\) is the index of the different ices_rectangle term
  • \(p\) is the index of the predator species the observation belongs to
  • \(\beta_{0}\) is the overall intercept term
  • \(H_{0h}\) is the random intercept term for haul_id, \(h\)
  • \(Y_{0y}\) is the random intercept term for year, \(y\)
  • \(R_{0r}\) is the random intercept term for ices_rectangle, \(r\)
  • \(\beta_{1}\) is the slope term that’s dependent on the fixed effect (the predator mass)
  • \(H_{1h}\) is the slope term that’s dependent on the haul_id
  • \(Y_{1y}\) is the slope term that’s dependent on the year
  • \(R_{1r}\) is the slope term that’s dependent on the ices_rectangle
  • \(\alpha_p\) is the slope term which is dependent on the predator species (and each predator species \(p\) has a unique slope term)
  • \(\epsilon_{ihyr}\) is the error term, which describes the residuals of each point

(where each random intercept term and the residuals are normally distributed with a mean of \(0\) and a standard deviation \(\sigma_{[00]}\) (i.e. \(sim (0, \sigma^2_{[00]})\))).

and,

\[ \text{species} = \{\text{Blue Whiting}, \text{ Cod}, \text{ Common Dab}, \text{ European Hake}, \ldots, \text{ Whiting}\} \]

is an array containing all possible predator species types, so:

\[ \text{pred}_p = \begin{cases} 1, \text{if the species that observation i belongs to is } \text{species}[p] \\ 0, \text{ otherwise} \end{cases} \]

So,

  • Fixed effects: \(\log\)(predator mass), predator species
  • Random effects: ship name, year, ices_rectangle

This model has the following variables:

Variable name Meaning Value
\(\beta_{0}\) the intercept term 5.0363484
\(\beta_{1}\) the fixed effect (slope) for \(\log(\text{predator mass})\) 0.2737651
\(\sigma_{H[0h]}\) s.d. for the intercept term \(H_{0h}\), dependent on the haul_id_short 1.6247
\(\sigma_{H[1h]}\) s.d. for the slope term \(H_{1h}\), dependent on the haul_id_short 0.2509
\(\sigma_{Y[0y]}\) s.d. for the intercept term \(Y_{0y}\), dependent on the year 1.0074
\(\sigma_{Y[1y]}\) s.d. for the slope term \(Y_{1y}\), dependent on the year 0.1194
\(\sigma_{R[0r]}\) s.d. for the intercept term \(R_{0r}\), dependent on the ices_rectangle 0.8934
\(\sigma_{R[1r]}\) s.d. for the slope term \(R_{1r}\), dependent on the ices_rectangle 0.1545
\(\sigma_{\epsilon[ihyr]}\) the s.d. of the residual/error term 1.8944

and the fixed effects coefficients are:

Predator Species Estimated gradient (\(\alpha_p\))
Blue Whiting
Cod -0.2123354446
Common Dab -0.3095996172
European Hake -1.4461576412
Haddock 0.5266592309
Herring 1.0340249983
Horse Mackerel 1.1556948335
Mackerel 1.0736578125
Megrim -0.7387651598
Monkfish -1.9261290331
Norway Pout 0.8562367326
Plaice 0.7833357976
Poor Cod -0.0006362586
Sole 0.6118634430
Sprat 0.9023672482
Whiting -0.4030090827

BLUE WHITING??

qqnorm(resid(final),xlab = "Theoretical Residuals", ylab = "Sample Residuals",
       main="Normal Q-Q plot of residuals - final mixed effect model")
qqline(resid(final), col="blue")

???

renamed_df %>% 
  ggplot(aes(x = resid(final))) +
  geom_density(aes(linetype="line", color="line")) +
  labs(title='Density plot of the residuals', x='Residuals', y='Density', 
              color="Line", linetype="Line") +
  stat_function(fun = dnorm, aes(linetype = "normal", color="normal"),
              args= with(final, c(mean = mean(resid(final)), sd(resid(final))))) +
  scale_color_manual(values=c('red', 'black')) +
  scale_linetype_manual(values = c("solid", "dashed"))

???

9.4 Separated by predator species - Density of residual

Separating the density of residuals plot out over the predator species gives:

renamed_df %>% 
  ggplot(aes(x = resid(final), color=pred_species)) +
  geom_density() +
  labs(title='Density plot of the residuals - separated over species',
              x='Residuals', y='Density') +
  facet_wrap(~pred_species) +
  theme(legend.position="none")

?????

9.5 Comparisons for Mackerel

We now look at specific plots for Mackerel which allow us to make some conclusions about the feeding preferences of the species Mackerel.

mackerel_df <-  renamed_df %>% filter(pred_species == fixed("Mackerel"))

9.5.1 Distribution of prey types

ggplot(mackerel_df, aes(x=prey_type, fill=prey_type)) +
   geom_bar_pattern(stat = "count",
                    pattern_color = "white",
                    pattern_fill = "black",
                    aes(pattern = prey_type, pattern_angle = prey_type)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="none") +
  labs(x="Prey type", y="Number of Observations")

Mackerel mostly eat Zooplankton and prey belonging to the group `other’.

9.5.2 Density of lppmr observations

ggplot(mackerel_df, aes(x=lppmr)) +
  geom_density(aes(weight = no._prey_per_stmch), color="lightseagreen") +
  labs(x="log(PPMR)", y="Number density of observations")

Mackerel have a wide distribution of \(\log\)(PPMR) values, suggesting that there is a wide range of prey masses consumed by predator.

9.5.3 Density plot of the residuals

mackerel_df %>% 
  ggplot(aes(x = resid(c))) +
  geom_density(colour="lightseagreen") +
  labs(x='Residuals', y='Density') +
  theme_classic()

The residuals are fairly widely distributed. This means that the observed points of the fitted \(\log\)(PPMR) against \(\log\)(predator mass) plot are very spread out, and therefore there is a wide distribution of \(\log\)(PPMR) values across the species Mackerel. This then suggests that a predator such as Mackerel feeds on a wide distribution of prey masses.